We, Mahima Mehandiratta (0801962), Faizal Shaikh(0816124), and Asbin Ghimire (0803803), hereby state that we have not communicated with or gained information in any way from any person or resource that would violate the College’s academic integrity policies, and that all work presented is our own. In addition, we also agree not to share our work in any way, before or after submission, that would violate the College’s academic integrity policies.
R version used- R version 4.2.1 (2022-06-23 ucrt)
RStudio used- 2022.07.1 version
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(here)
## here() starts at C:/Users/hp/OneDrive/Documents
library(ggplot2)
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.2.2
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
library("plotrix")
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
Student_data <- read.csv("C:/Users/hp/OneDrive/Documents/DAB501/StudentsPerformance.csv")
Student_data
ggplot(data=Student_data)+
geom_histogram(mapping=aes(x=reading.score),
fill='skyblue', colour='black', binwidth= 2) +
labs(x= "Reading Score", y = "COUNT", title = "Students Reading Score")
ggplotly(Student_data %>%
ggplot(mapping = aes(math.score))+
geom_histogram(binwidth = 10, bins = 50, fill = 'skyblue', color = 'black')+
labs(x = "Score in Mathematics", y = "Count of Students", title = "Score of students in Mathematics"))
ggplot(Student_data) +
geom_histogram(mapping = aes(x = writing.score), fill = 'skyblue', colour = 'black',
bins = 15, na.rm = TRUE) +
labs(x= "writing score", y = "Count", title = "Students Writing Score")
ggplot(Student_data, aes(reading.score)) +
geom_boxplot(fill="skyblue") +
labs( title= "Reading score of Students",
x="Reading Score",y="Count")
Ans2: To determine the outliers in the reading.score variable boxplot was used. The students with marks lower than approximately 35 are shown in the boxplot as outliers. We cannot ignore the outliers since the students with lower marks are necessary to do the proper analysis of the variable reading_score. Thus we will opt not to neglect the outliers.
ggplotly(Student_data %>%
ggplot(mapping = aes(y = math.score))+
geom_boxplot(fill = "skyblue")+
labs(y = "Score in Mathematics", title = "Checking outliers in the dataset"))
Ans2: Outliers are the data points that differ from most values in the dataset. We can clearly see by plotting the boxplot that there are a few outliers in the dataset. There are many ways to handle the outliers. We can use interquartile range (IQR) to handle outliers. Anything which is more than (Q3 + 1.5 * IQR) and anything which is less than (Q1 - 1.5 * IQR) is considered to be a potential outlier. If the outlier value is greater than 99 percentile, we canassign the value of 99 percentile to that variable and if it is less than 1st percentile, we can assign that value to that outlier. We can also assign the value which is near to the median and outlier.
ggplot(data = Student_data, mapping = aes(x = writing.score)) +
geom_boxplot(fill= 'skyblue')+
labs(x= "students writing score outliers")
Ans2: To determine the outliers in the reading.score variable boxplot was used. The students with marks lower than approximately 35 are shown in the boxplot as outliers. We cannot ignore the outliers since the students with lower marks are necessary to do the proper analysis of the variable reading_score. Thus we will opt not to neglect the outliers.
ggplot(data=Student_data)+
geom_histogram(mapping=aes(x=reading.score), fill='skyblue',
colour='black',binwidth= 2)+
labs(x= "Reading Score", y = "COUNT", title = "Students Reading Score")
Ans3: The distribution has a very slight left skew as the reading scores are concentrated to the right and we can see two peaks, thus it is a combination of left skewed and bimodal distribution.
ggplotly(Student_data %>%
ggplot(mapping = aes(math.score))+
geom_histogram(binwidth = 10, bins = 50, fill = 'skyblue', color = 'black')+
labs(x = "Score in Mathematics", y = "Count of Students", title = "Score of students in Mathematics"))
Ans3. The histogram plotted for the score of students in Mathematics represents that the distribution is unimodal and left skewed.
ggplot(Student_data) +
geom_histogram(mapping = aes(x = writing.score), fill = 'skyblue', colour = 'black',
bins = 15, na.rm = TRUE) +
labs(x= "writing score", y = "Count", title = "Students Writing Score")
Shape: Observing the box plot, the shape can be considered as uni model. Skweness: The box plot above shows that the distribution is slightly left skewed. Also, the median value is in center.
p1 <-ggplot(data=Student_data)+geom_histogram(mapping=aes(x=reading.score),
fill='skyblue', colour='black', binwidth= 2, na.rm = TRUE)
p2 <- p1 + scale_x_log10()
p3 <- p1 +scale_x_sqrt()
grid.arrange(p1,p2, p3, ncol= 3)
Ans4: Thus as seen above the logarithmic or square root transformation is not required to transform our variable. Since, it should be used for features that has a very long tail, it could have been used in data where the marks would have had a large variation like in thousands which is not possible in our data set, since the marks are out of 100. In our data set we are able to see the trends and variations clearly hence there’s no need of the transformation.
Ans4: Logarithmic or Square root transformations are applied where the distribution has log tails. It is not necessary to apply transformations as the distribution is not extremely skewed. If the distribution is extremely skewed, it is easier to model them as the outliers become less important after transformation. We cannot apply transformation using log transformation as there are some 0 values whose log value comes out to be negative infinity which makes the analysis difficult to interpret as the values are meaningless.
Ans4: The idea of transformation is not applicable in this distribution, as shown by the graph above, where both the original and log distributions exhibit the normal distribution.median(Student_data$reading.score)
## [1] 70
median(Student_data$math.score)
## [1] 66
median(Student_data$writing.score)
## [1] 69
P <- median(Student_data$reading.score)
ggplot(Student_data)+
geom_histogram(mapping=aes(x=reading.score),
fill='skyblue', colour='black',binwidth= 2, na.rm = TRUE) +
geom_vline(xintercept = P, color = 'red', size = 1.5)+
labs(title= "Distribution of Reading Score",x="Reading Score", y= 'COUNT')
Ans6: It is better to choose median as a measure of central tendency since it not affected by outliers and it is less likely to be affected by a skewed distribution whereas mean is not preferable since it is highly afected by the outliers.
ggplotly(Student_data %>%
ggplot(mapping = aes(math.score))+
geom_histogram(binwidth = 10, bins = 50, fill = 'skyblue', color = 'black')+
geom_vline(xintercept = median(Student_data$math.score), color = 'red', size = 1.5)+
labs(x = "Score in Mathematics", y = "Count of Students", title = "Score of Students in Mathematics"))
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.
Ans6: There are many measures of central tendencies such as mean, median, mode but the purpose of choosing the median as the measure of central tendency is because median is much more robust against the outliers than the mean. Also, since the distribution for math score of students is a bit left skewed it would be a better option to consider median as the center for the data.
Ans6: We computed the central tendency using median because skewness is so negligible in the above distribution. Since the median fits robust statistics the best, the median has been utilised as the central tendency. No matter how the data are distributed, median always yields a central value. In contrast to mean, median chooses to remain within a previous range even after adding some uncommon values to the data set. As a result, adding or removing some large or small value does not significantly change the median’s centre value.IQR(Student_data$reading.score)
## [1] 20
Ans7: Interquartile Range (IQR) is an efficient way to calculate the measure of the spread in accordance to the measure of central tendency chosen above, i.e, median. The quartiles are not much affected by the values of the outliers in a skewed data set than the other measures such as standard deviation. Therefore, quartiles are the best choice to measure the spread of the data along with the median when we have a skewed data set with a few outliers. IQR is calculated as the difference of the two quartiles Q3 and Q1 and gives us a range in which most of the values of the variable lies.
iqr <- IQR(Student_data$math.score)
iqr
## [1] 20
Ans7: Since this data variable has a skewed distribution and contains outliers as well, the Interquartile Range (IQR) is the ideal measure of spread. However, IQR can also be used to characterise measures of spread in other situations. Additionally, IQR is preferred for spread since the median and IRQ are less susceptible to skewness and outliers.
ggplot(data=Student_data)+
geom_bar(mapping=aes(x=parental.level.of.education),
fill='skyblue', colour='black', binwidth= 2) +
labs(title="Count of Parental Level Of Education",
x= "Parental Level Of Education", y=" COUNT")
## Warning: Ignoring unknown parameters: binwidth
ggplotly(Student_data %>%
ggplot(mapping = aes(x = race.ethnicity)) +
geom_bar(color = "black",fill = "skyblue")+
labs(title="Count of Different Ethnicities",
y = "Count", x = "Race/Ethnicity"))
count_lunch <- Student_data %>% group_by(lunch) %>% count()
ggplot(count_lunch, aes(x = lunch, y = n)) +
geom_bar(fill = "skyblue", color='black', stat = "identity") +
geom_text(aes(label = n), vjust = -0.3)+
labs(title = "Bar Plot Representing lunch column", x="Statistics Data",y="Count")
data <- table(Student_data$parental.level.of.education)
labels <- paste(names(data), data)
labper <- paste0(names(data), " (" ,round(100 * data/sum(data),2), "%)")
pie3D(data, labels= labper, main= "Proportion of parental level of education")
data <- table(Student_data$race.ethnicity)
labper <- paste0(names(data), " (", round(100 * data/sum(data),2), "%)")
pie3D(data, labels = labper, explode = 0.1, main = "Proportion of different race/ethnicities")
ggplot(Student_data)+
geom_bar(aes(x=lunch,y= ..prop.. ,group=1),fill="skyblue",color='black')+
labs(title="Bar Chart",
subtitle = "chart to represent Proportion of lunch variable",
x="Statistics Data",
y="Value in Proportion")
Ans3: There’s a little disparity in the distribution parental level of education variable. The variable is unequally distributed with master’s degree accounting for only 5.9% whereas some college and associate accounting for approximately 22%.
Student_data %>% group_by(race.ethnicity) %>%
summarize(n = n()) %>%
arrange(n)
Ans3: An unusual observation about the variable may be that the students belonging to group C of race/ethnicity are more than 3 times the students belonging to group A.
Ans3: Although there is no any unusual observation, we can conclude that the standard category of lunch is almost double of the free/reduced category.
unique(Student_data$parental.level.of.education)
## [1] "bachelor's degree" "some college" "master's degree"
## [4] "associate's degree" "high school" "some high school"
Ans4: We have 6 unique values in the variable
parental_level_of_education namely: [1] “bachelor’s degree” “some
college” “master’s degree”
[4] “associate’s degree” “high school” “some high school”
Ans4:
unique(Student_data$race.ethnicity)
## [1] "group B" "group C" "group A" "group D" "group E"
Ans4: There are only 5 unique values :
unique(Student_data$lunch)
## [1] "standard" "free/reduced"
This categorical variable has just two unique values i.e standard and free/reduced
Numeric and Numeric
Student_data %>% ggplot(mapping=aes(x=writing.score, y = math.score))+
geom_point(color= "darkblue") +
geom_smooth(color= "red", method="lm")+
labs(x = "Writing Score",
y = "Math Score",
title = "Writing Score vs Math Score")
## `geom_smooth()` using formula 'y ~ x'
ggplotly(Student_data %>%
ggplot(aes(x = reading.score,
y = math.score,)) +
geom_point(color='darkblue')+
geom_smooth(formula = y ~ x,
method = "gam",color='red')+
labs(title="Relationship between scores in Reading and Mathematics",
y = "Score in Mathematics", x = "Score in Reading"))
ggplot(Student_data,
aes(x = reading.score,
y = writing.score)) +
geom_point(color= "darkblue") +
geom_smooth(method = "lm",color='red')+
labs(x = "reading score",
y = "writing score",
title = "reading score vs writing score")
## `geom_smooth()` using formula 'y ~ x'
Student_data %>% ggplot(mapping=aes(x=reading.score, y = parental.level.of.education,
fill= parental.level.of.education))+
geom_boxplot() +
labs(x = "Writing Score",
y = "Math Score",
title = "Writing Score vs Math Score")
ggplot(data=Student_data,aes(x=parental.level.of.education,y=writing.score,
fill=parental.level.of.education))+
geom_violin(col="blue")+
scale_x_discrete(guide = guide_axis(n.dodge=3)) +
labs(title = "Relationship between students' parental education and
their writingscore",caption="Data source: Students Performance in Exams",
x="Parental Level Of Education",y="Writing Score")
ggplotly(Student_data %>%
ggplot(aes(x = race.ethnicity,
y = math.score, fill = race.ethnicity)) +
geom_boxplot()+
labs(title="Ethnicity vs Mathematics Score",
y = "Score in Mathematics", x = "Race/Ethnicity"))
Ans2: FORM: The form of the observed relationship between the variables writing score is linear which suggests that students having a low score in maths also have a low score in reading and students having high score in maths also have a high score in reading. DIRECTION: The direction of the relationship observed between the two variables is positive since when we plot a geom smooth it passes through all the plots in a positive manner. STRENGTH: The strength is positive as we use the correlation function we get an answer 0.802624 which is closer to 1, which suggests that the relationship follows a strong positive correlation.
cor(Student_data$writing.score, Student_data$math.score)
## [1] 0.802642
Ans2: Form, direction and strength can be interpreted using the scatter plot. Form tells us whether the relationship is linear or not. Using geom_smooth() function in the scatter plot for the numerical variables tells us that the relationship is linear. The direction tells us whether the relationship is in positive or negative direction. The plot shows that it is positive. The strength tells us if the relationship between variables is strong, fairly strong, or weak. The plot as well as correlation coefficient (0.8175797) shows that there is a very strong relationship between the variables.
cor(Student_data$reading.score, Student_data$math.score)
## [1] 0.8175797
Ans2: The relationship between two variables (reading and writing score) is linear since all of the data points on the graph are closely spaced out from one another and a line simply passes through all of them. The values on the y-axis increase in an upward sloping line as one variable in x increases, indicating that the x and y variables are directly proportional to one another. Additionally, the majority of the dots are quite close to the line and not all of the dots are that far from the line, indicating a positive and quite strong link. Here, the correlation coefficient between the two variables has been determined as follows in order to determine the precise link mathematically: The correlation between these two variable is almost 1. Hence, they have a positive linear relationship.
cor(Student_data$reading.score,Student_data$writing.score)
## [1] 0.9545981
Ans3: The relationship being strong positive linear suggests that the students having low marks in reading had a low marks in maths too. Students having average marks in maths did average in reading too and students having good marks in reading had good marks in maths too. Overall, the data suggests that there is a less likelihood of students doing good in one reading and bad in math and vice versa.
Ans3: Correlation coefficent tells us about the strength of the relationship between the two variables. Its value lies between -1 and 1. 1 indicates that there is a strong positive relationship between the variables i.e both the variables move in same direction. -1 indicates that there is a strong negative relationship between the variables i.e both the variables move in opposite direction. 0 shows no relationship between the two variables. The scatter plot for the numerical variables show that as the score of students in Mathematics increase, the score of students in Reading also increases. There are a few outliers but for most values the trend is linear and represents a strong relationship between the variables.
Ans3: There does not seem to be any outliers when the value of the writing score climbs along with the value of the reading score, demonstrating the strong association between the two.
Ans4: The variability observed in the plot is a positive linear and the strength calculated using the correlation which is 0.802624 suggests a strong correlation. Thus, it makes sense to say looking at the plot and the strength that the variable follows a strong linear correlation in a positive direction.
Ans4: The correlation coefficient calculated above (= 0.8175797) helps us interpret that the relationship between the two chosen numerical variables (Reading score and Math score) is very strong as it is close to 1. This means that both the variables move in same direction positively. Variability of the dataset can be measured through variance. Variance is a measure of dispersion or spread that depicts the spread of values in the dataset. It measures the variability from the mean.mean(Student_data$math.score)
## [1] 66.089
Variance depicts the spread of the values around the mean (= 66.089) and we can see from the scatter plot that most of the values are clustered or spread out around the mean.
ggplotly(Student_data %>%
ggplot(aes(x = reading.score,
y = math.score,)) +
geom_point()+
geom_smooth(formula = y ~ x,
method = "gam")+
geom_vline(xintercept = mean(Student_data$math.score), color = 'red', size = 1.5)+
labs(title="Relationship between scores in Reading and Mathematics",
y = "Score in Mathematics", x = "Score in Reading"))
Ans4: There is a significant association between the two variables since the correlation coefficient between the reading and writing scores is 0.9545, which is almost 1.